While my previous posts outlined some methods for conducting EDA for numeric data as well as categorical data, this post focuses on EDA for images.

What is Exploratory Data Analysis (EDA)?

Again, since all learning is repetition, EDA is a process by which we 'get to know' our data by conducting basic descriptive statistics and visualizations.

Why is it done for images?

We need to know:

  • how many images we have
  • if we're doing supervised learning, if they labeled appropriately
  • their format (i.e. size and color)

How do we do it in Python?

As always, I'll follow the steps outlined in Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow

Step 1: Frame the Problem

"Is it possible to determine the minimum age a reader should be for a given book based solely on the cover?"

Step 2: Get the Data

As mentioned in my previous posts, I sourced labeled training data from Common Sense Media's Book Reviews by scraping and saving the target pages using BeautifulSoup

and then extracted and saved the book covers into a separate folder.

In the end, I was able to use over 5000 covers for training and testing purposes, but today we'll only work with a sample of the covers which can be downloaded from here.

Step 3: Explore the Data to Gain Insights (i.e. EDA)

As always, import the essential libraries, then load the data.

import pandas as pd
import numpy as np
import os
import cv2
from PIL import Image
import ipyplot

How large is our sample?

IMAGES_PATH = "data/covers/"
image_files = list(os.listdir(IMAGES_PATH))
full_file_paths = [IMAGES_PATH+image for image in image_files] 
print("Number of image files: {}".format(len(image_files)))
Number of image files: 561

What does our target look like?

To answer that question, we can create a data frame of the book titles and the target ages in our sample, and then plot the target.

Since I scraped the data, I know the beginning of the file name is the target age, (i.e., 13 is the minimum age for the file '13_dance-of-thieves-book-1.jpg') so we can create a data frame of the:

  1. file names
  2. full paths
  3. target column called age by splitting the file name on the underscore and extracting the first element
data = {'files':image_files, 'full_path':full_file_paths}
df = pd.DataFrame(data=data)
df['age'] = df['files'].str.split("_").str[0].astype('int')
df.head()
files full_path age
0 13_dance-of-thieves-book-1.jpg data/covers/13_dance-of-thieves-book-1.jpg 13
1 11_ways-to-live-forever.jpg data/covers/11_ways-to-live-forever.jpg 11
2 13_this-time-will-be-different.jpg data/covers/13_this-time-will-be-different.jpg 13
3 10_the-care-and-keeping-of-you-2-the-body-book... data/covers/10_the-care-and-keeping-of-you-2-t... 10
4 8_moonpenny-island.jpg data/covers/8_moonpenny-island.jpg 8

Now we can plot the age feature.

df['age'].plot(kind= "hist", 
               bins=range(2,18),
               figsize=(24,10),
               xticks=range(2,18),
               fontsize=16);

Thankfully, the plot above has a nearly identical distribution to the entire sample (see this post) so all is good and we can continue.

def get_image_path(age):
    full_path = df.loc[df['age']==int(age), 'full_path'].iloc[0]
    return full_path

random_images = [get_image_path(x) for x in range(2, 18)]

What do our covers look like?

We know the general shape of our target, but let's get a feel for what the targets (i.e. the book covers) look like by using the IPyPlot package.

To do so, we convert the path to the images and the target numpy arrays:

images = df['full_path'].to_numpy()
labels_int = df['age'].to_numpy()

and then pass them as arguments to the plot_class_representations function which will return the first instance of each of our targets.

In other words, the function will print the first book which rated for 2 year olds, 3 year olds, 4 year olds, (etcetera) until all levels of the target are represented.

ipyplot.plot_class_representations(images=images, labels=labels_int, force_b64=True)

2

data/covers/2_ten-little-caterpillars.jpg

3

data/covers/3_bully.jpg

4

data/covers/4_thomas-big-storybook.jpg

5

data/covers/5_yukon-sled-dog.png

6

data/covers/6_all-in-a-day-0.jpg

7

data/covers/7_enigma-a-magical-mystery.jpg

8

data/covers/8_moonpenny-island.jpg

9

data/covers/9_the-wide-window-a-series-of-unfortunate-events-book-3.jpg

10

data/covers/10_the-care-and-keeping-of-you-2-the-body-book-for-older-girls.jpg

11

data/covers/11_ways-to-live-forever.jpg

12

data/covers/12_gilded.jpg

13

data/covers/13_dance-of-thieves-book-1.jpg

14

data/covers/14_the-madmans-daughter.jpg

15

data/covers/15_a-sense-of-the-infinite.jpg

16

data/covers/16_the-round-house.jpg

17

data/covers/17_pretty-dead.jpg

:thinking: There seems to be a pattern

ipyplot.plot_class_tabs(images=images, labels=labels_int, force_b64=True)

0

data/covers/2_ten-little-caterpillars.jpg

1

data/covers/2_when-mama-comes-home-tonight.jpg

2

data/covers/2_your-babys-first-word-will-be-dada.jpg

3

data/covers/2_te-amo-sol-te-amo-luna-i-love-you-sun-i-love-you-moon.jpeg

4

data/covers/2_cat-the-cat-who-is-that.jpg

5

data/covers/2_goodnight-moon.jpg

6

data/covers/2_corduroy.jpg

7

data/covers/2_10-little-monsters-a-counting-book.jpg

8

data/covers/2_the-little-old-lady-who-was-not-afraid-of-anything.jpg

9

data/covers/2_maxs-chocolate-chicken.jpg

0

data/covers/3_bully.jpg

1

data/covers/3_the-donut-chef.jpg

2

data/covers/3_big-bear-little-chair.jpg

3

data/covers/3_fetch.jpeg

4

data/covers/3_a-funny-little-bird.jpg

5

data/covers/3_the-biggest-smallest-christmas-present.jpg

6

data/covers/3_penguin-problems.jpg

7

data/covers/3_camp-rex.jpg

8

data/covers/3_the-day-the-crayons-quit.jpg

9

data/covers/3_ah-ha.jpg

0

data/covers/4_thomas-big-storybook.jpg

1

data/covers/4_goodnight-good-dog.jpg

2

data/covers/4_baby-monkey-private-eye.jpg

3

data/covers/4_lion-lessons.jpg

4

data/covers/4_chicken-cheeks.jpg

5

data/covers/4_not-afraid-of-dogs.jpg

6

data/covers/4_scrambled-eggs-super.jpg

7

data/covers/4_ten-creepy-monsters.jpg

8

data/covers/4_hooray-for-hat.jpg

9

data/covers/4_harriet-gets-carried-away.jpg

0

data/covers/5_yukon-sled-dog.png

1

data/covers/5_the-princess-in-black-takes-a-vacation.jpg

2

data/covers/5_aliens-are-coming.jpg

3

data/covers/5_the-last-day-of-kindergarten.jpg

4

data/covers/5_where-are-you-from.jpg

5

data/covers/5_today-i-will-fly-an-elephant-piggie-book.jpg

6

data/covers/5_i-am-love.jpg

7

data/covers/5_zelda-and-ivy-series.jpg

8

data/covers/5_i-wish-you-more.jpeg

9

data/covers/5_keeker-and-the-sneaky-pony.jpg

0

data/covers/6_all-in-a-day-0.jpg

1

data/covers/6_almost-to-freedom.jpg

2

data/covers/6_once-upon-a-twice.jpg

3

data/covers/6_come-back-amelia-bedelia.jpg

4

data/covers/6_the-secrets-of-animal-flight.jpg

5

data/covers/6_in-daddys-arms-i-am-tall-african-americans-celebrating-fathers.jpg

6

data/covers/6_the-sun.jpg

7

data/covers/6_capital-mysteries-series.jpg

8

data/covers/6_dog-diaries-a-middle-school-story.jpg

9

data/covers/6_fantastic-mr-fox.jpg

0

data/covers/7_enigma-a-magical-mystery.jpg

1

data/covers/7_the-legendary-miss-lena-horne.jpg

2

data/covers/7_oggie-cooder-1.jpg

3

data/covers/7_the-return-of-zita-the-spacegirl.jpg

4

data/covers/7_lulu-series.jpg

5

data/covers/7_the-boy-who-touched-the-stars-el-nino-que-alcanzo-las-estrellas.jpg

6

data/covers/7_how-many.jpg

7

data/covers/7_a-christmas-memory.jpg

8

data/covers/7_basil-of-baker-street-the-great-mouse-detective-book-1.jpg

9

data/covers/7_the-borrowers.jpg

0

data/covers/8_moonpenny-island.jpg

1

data/covers/8_big-game-funjungle-book-3.jpg

2

data/covers/8_fortunately-the-milk.jpg

3

data/covers/8_spirit-week-showdown-magnificent-mya-tibbs-book-1.jpg

4

data/covers/8_the-91-story-treehouse-the-treehouse-books-book-7.jpg

5

data/covers/8_best-friends.jpg

6

data/covers/8_sassy-series.jpg

7

data/covers/8_story-thieves-book-1.jpeg

8

data/covers/8_the-enchantress-returns-the-land-of-stories-book-2.jpg

9

data/covers/8_the-13-clocks.jpg

0

data/covers/9_the-wide-window-a-series-of-unfortunate-events-book-3.jpg

1

data/covers/9_star-wars-the-return-of-the-jedi-beware-the-power-of-the-dark-side.jpg

2

data/covers/9_red-rackhams-treasure-the-adventures-of-tintin.jpg

3

data/covers/9_the-golden-dream-of-carlo-chuchio.jpg

4

data/covers/9_fall-of-heroes-the-cloak-society-book-3.jpg

5

data/covers/9_the-shakespeare-stealer.jpg

6

data/covers/9_wanderville.jpg

7

data/covers/9_the-mighty-miss-malone.jpg

8

data/covers/9_the-looking-glass-wars-book-1.jpg

9

data/covers/9_wizards-holiday.jpg

0

data/covers/10_the-care-and-keeping-of-you-2-the-body-book-for-older-girls.jpg

1

data/covers/10_turning-15-on-the-road-to-freedom-my-story-of-the-selma-voting-rights-march.jpg

2

data/covers/10_a-month-of-sundays.jpg

3

data/covers/10_the-wednesday-wars.jpg

4

data/covers/10_the-fowl-twins.jpg

5

data/covers/10_the-supernaturalist.jpg

6

data/covers/10_a-wizard-of-earthsea-the-earthsea-cycle-book-1.jpg

7

data/covers/10_a-plague-of-bogles.jpg

8

data/covers/10_the-riverman.jpg

9

data/covers/10_apocalypse-taco.jpg

0

data/covers/11_ways-to-live-forever.jpg

1

data/covers/11_forge-the-seeds-of-america-trilogy-book-2.jpg

2

data/covers/11_the-lord-of-opium.jpg

3

data/covers/11_foiled.jpg

4

data/covers/11_the-red-pencil.jpg

5

data/covers/11_bluecrowne-a-greenglass-house-story.jpg

6

data/covers/11_high-wizardry-young-wizards-series-book-3.jpg

7

data/covers/11_step-by-wicked-step-a-novel.jpg

8

data/covers/11_after-tupac-and-d-foster.jpg

9

data/covers/11_cleopatra-rules-the-amazing-life-of-the-original-teen-queen.png

0

data/covers/12_gilded.jpg

1

data/covers/12_scars-like-wings.jpg

2

data/covers/12_if-i-ever-get-out-of-here.jpg

3

data/covers/12_boots-on-the-ground-americas-war-in-vietnam.jpg

4

data/covers/12_the-girl-of-fire-and-thorns-book-1.png

5

data/covers/12_abarat-days-of-magic-nights-of-war-the-abarat-trilogy-book-2.jpg

6

data/covers/12_the-tragedy-paper.jpg

7

data/covers/12_american-ace.jpg

8

data/covers/12_its-perfectly-normal-changing-bodies-growing-up-sex-and-sexual-health.jpg

9

data/covers/12_hannah-daughters-of-the-sea.jpg

0

data/covers/13_dance-of-thieves-book-1.jpg

1

data/covers/13_this-time-will-be-different.jpg

2

data/covers/13_gem-dixie.png

3

data/covers/13_dear-bully-70-authors-tell-their-stories.jpg

4

data/covers/13_on-the-fence.jpg

5

data/covers/13_screen-queens.jpg

6

data/covers/13_before-we-were-free.jpg

7

data/covers/13_of-mice-and-men.jpg

8

data/covers/13_defy-the-stars.jpeg

9

data/covers/13_the-towering-sky-the-thousandth-floor-book-3.jpg

0

data/covers/14_the-madmans-daughter.jpg

1

data/covers/14_neighborhood-girls.jpg

2

data/covers/14_another-day.jpg

3

data/covers/14_slow-burn-the-anchor-and-sophia-book-2.jpg

4

data/covers/14_repossessed.jpg

5

data/covers/14_blood-red-road-dust-lands-book-1.jpg

6

data/covers/14_a-heart-so-fierce-and-broken-cursebreaker-book-2.jpg

7

data/covers/14_confessions-of-a-murder-suspect.jpg

8

data/covers/14_the-ship-beyond-time-the-girl-from-everywhere-book-2.jpg

9

data/covers/14_for-whom-the-bell-tolls.jpg

0

data/covers/15_a-sense-of-the-infinite.jpg

1

data/covers/15_aspen.jpg

2

data/covers/15_perfect-0.jpg

3

data/covers/15_the-diviners-book-1.jpg

4

data/covers/15_breathe-my-name.jpg

5

data/covers/15_dear-evan-hansen-the-novel.jpg

6

data/covers/15_the-program-book-1.jpeg

7

data/covers/15_one-day.jpg

8

data/covers/15_the-wrenchies.jpg

9

data/covers/15_the-tenth-girl.jpg

0

data/covers/16_the-round-house.jpg

1

data/covers/16_1984.jpg

2

data/covers/16_the-handmaids-tale.jpg

3

data/covers/16_testimony-from-your-perfect-girl.jpg

4

data/covers/16_home-after-dark.jpeg

5

data/covers/16_dirty-wings.jpg

6

data/covers/16_impulse.jpg

7

data/covers/16_exile-from-eden.jpg

0

data/covers/17_pretty-dead.jpg

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams
width = []
height = []
channels = []
for image in image_files: 
    img = cv2.imread(IMAGES_PATH+image)
    img = img.shape
    height.append(img[0])
    width.append(img[1])
    channels.append(img[2])
df['width'] = width
df['height'] = height
df['channels'] = channels
df.head()
files age width height channels
0 13_dance-of-thieves-book-1.jpg 13 170 255 3
1 11_ways-to-live-forever.jpg 11 170 255 3
2 13_this-time-will-be-different.jpg 13 170 255 3
3 10_the-care-and-keeping-of-you-2-the-body-book... 10 170 255 3
4 8_moonpenny-island.jpg 8 170 255 3

Summary

  • :ballot_box_with_check: numeric data
  • :ballot_box_with_check: categorical data
  • :black_square_button: images (book covers)

Two down; one to go!

Going forward, my key points to remember are:

What type of categorical data do I have?

There is a huge difference between ordered (i.e. "bad", "good", "great") and truly nominal data that has no order/ranking like different genres; just because I prefer science fiction to fantasy, it doesn't mean it actually is superior.

Are missing values really missing?

Several of the features had missing values which were, in fact, not truly missing; for example, the award and awards features were mostly blank for a very good reason: the book didn't win one of the four awards recognized by Common Sense Media.

In conclusion, both of the points above can be summarized simply by as "be sure to get to know your data."

Happy coding!

Footnotes


2. Be sure to check out this excellent post by Jeff Hale for more examples on how to use this package



4. Big Thank You to Chaim Gluck for providing this tip